Pembuatan Model Prediksi Hujan Menggunakan Metode K-Nearest Neighbors

Authors

Romeo

Ruslan

Hana

Haifan

Ana

Zahra

Import Libraries


Code
import numpy as np # Numerical Computations
import pandas as pd # Data Preprocessing

# Import libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go

pio.renderers.default = "plotly_mimetype+notebook_connected"

Import Dataset


Code
df = pd.read_csv('weatherAUS.csv')

Eksplorasi Data


Preview dataset

Code
df.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W ... 71.0 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8 No No
1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW ... 44.0 25.0 1010.6 1007.8 NaN NaN 17.2 24.3 No No
2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W ... 38.0 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2 No No
3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE ... 45.0 16.0 1017.6 1012.8 NaN NaN 18.1 26.5 No No
4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE ... 82.0 33.0 1010.8 1006.0 7.0 8.0 17.8 29.7 No No

5 rows × 23 columns

Note

Variabel target adalah RainTomorrow.

View dimension of dataset

Code
df.shape
(145460, 23)

Bisa kita lihat bahwa ada 145460 baris dan 23 kolom yang terdapat di dalam dataset.

View column names

Code
col_names = df.columns
col_names
Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

Checking For datatypes of the attributes

Code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null   float64
 18  Cloud3pm       86102 non-null   float64
 19  Temp9am        143693 non-null  float64
 20  Temp3pm        141851 non-null  float64
 21  RainToday      142199 non-null  object 
 22  RainTomorrow   142193 non-null  object 
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
  • Dapat kita lihat bahwa dataset tersebut berisi campuran variabel kategorikal dan numerik
  • Variabel kategorikal memiliki tipe data objek
  • Variabel numerik memiliki tipe data float64

View statistical properties of dataset

Code
df.describe()
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
count 143975.000000 144199.000000 142199.000000 82670.000000 75625.000000 135197.000000 143693.000000 142398.000000 142806.000000 140953.000000 130395.00000 130432.000000 89572.000000 86102.000000 143693.000000 141851.00000
mean 12.194034 23.221348 2.360918 5.468232 7.611178 40.035230 14.043426 18.662657 68.880831 51.539116 1017.64994 1015.255889 4.447461 4.509930 16.990631 21.68339
std 6.398495 7.119049 8.478060 4.193704 3.785483 13.607062 8.915375 8.809800 19.029164 20.795902 7.10653 7.037414 2.887159 2.720357 6.488753 6.93665
min -8.500000 -4.800000 0.000000 0.000000 0.000000 6.000000 0.000000 0.000000 0.000000 0.000000 980.50000 977.100000 0.000000 0.000000 -7.200000 -5.40000
25% 7.600000 17.900000 0.000000 2.600000 4.800000 31.000000 7.000000 13.000000 57.000000 37.000000 1012.90000 1010.400000 1.000000 2.000000 12.300000 16.60000
50% 12.000000 22.600000 0.000000 4.800000 8.400000 39.000000 13.000000 19.000000 70.000000 52.000000 1017.60000 1015.200000 5.000000 5.000000 16.700000 21.10000
75% 16.900000 28.200000 0.800000 7.400000 10.600000 48.000000 19.000000 24.000000 83.000000 66.000000 1022.40000 1020.000000 7.000000 7.000000 21.600000 26.40000
max 33.900000 48.100000 371.000000 145.000000 14.500000 135.000000 130.000000 87.000000 100.000000 100.000000 1041.00000 1039.600000 9.000000 9.000000 40.200000 46.70000

Explore Categorical Variables

Code
categorical = df.select_dtypes(include=['object'])
categorical.head()
Date Location WindGustDir WindDir9am WindDir3pm RainToday RainTomorrow
0 2008-12-01 Albury W W WNW No No
1 2008-12-02 Albury WNW NNW WSW No No
2 2008-12-03 Albury WSW W WSW No No
3 2008-12-04 Albury NE SE E No No
4 2008-12-05 Albury W ENE NW No No
  • Terdapat 6 variabel kategorikal dalam dataset. Variabel-variabel tersebut adalah: Lokasi, WindGustDir, WindDir9am, WindDir3pm, RainToday, dan RainTomorrow.
  • Ada dua variabel kategorikal biner yaitu RainToday dan RainTomorrow
  • RainTomorrow adalah variabel target

Missing values in Categorical Variables

Code
categorical.isna().sum().to_frame('number of null values')
number of null values
Date 0
Location 0
WindGustDir 10326
WindDir9am 10566
WindDir3pm 4228
RainToday 3261
RainTomorrow 3267

Explore Numerical Variables

Code
Numerical = df.select_dtypes(include=['float64','int'])
Numerical.head()
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
0 13.4 22.9 0.6 NaN NaN 44.0 20.0 24.0 71.0 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8
1 7.4 25.1 0.0 NaN NaN 44.0 4.0 22.0 44.0 25.0 1010.6 1007.8 NaN NaN 17.2 24.3
2 12.9 25.7 0.0 NaN NaN 46.0 19.0 26.0 38.0 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2
3 9.2 28.0 0.0 NaN NaN 24.0 11.0 9.0 45.0 16.0 1017.6 1012.8 NaN NaN 18.1 26.5
4 17.5 32.3 1.0 NaN NaN 41.0 7.0 20.0 82.0 33.0 1010.8 1006.0 7.0 8.0 17.8 29.7

Missing values in numerical variables

Code
Numerical.isnull().sum()
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustSpeed    10263
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
dtype: int64

Check for duplicated values

Code
df[df.duplicated()]
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow

0 rows × 23 columns

Data tidak mengandung duplikasi antar atribut

Visualisasi Data


Date Plot

Code
df_dateplot = df.iloc[-950:,:]
plt.figure(figsize=[20,5])
plt.plot(df_dateplot['Date'],df_dateplot['MinTemp'],color='blue',linewidth=1, label= 'MinTemp')
plt.plot(df_dateplot['Date'],df_dateplot['MaxTemp'],color='red',linewidth=1, label= 'MaxTemp')
plt.fill_between(df_dateplot['Date'],df_dateplot['MinTemp'],df_dateplot['MaxTemp'], facecolor = '#EBF78F')
plt.title('MinTemp vs MaxTemp by Date')
plt.legend(loc='lower left', frameon=False)
plt.show()

  • Plot di atas menunjukkan bahwa temperatur minimal dan maksimal relatif meningkat dan menurun setiap tahunnya.
  • Kondisi cuaca selalu berlawanan di kedua bagian. Seperti Australia yang terletak di belahan bumi bagian selatan musim-musimnya sedikit berbeda.
  • Seperti yang dapat kita lihat bahwa dari Desember hingga Februari adalah musim panas, dari Maret hingga Mei adalah musim gugur, dari Juni hingga Agustus adalah musim dingin, dan dari September hingga November adalah musim semi.

Visualisasi Distribusi Variabel Numerik

def plot_numerical_distributions(df):
    # Get numerical columns
    numerical_cols = df.select_dtypes(include=['float64']).columns
    
    # Calculate number of rows and columns needed for subplots
    n_plots = len(numerical_cols)
    n_cols = 4  # keeping 4 columns as in original
    n_rows = int(np.ceil(n_plots / n_cols))
    
    # Create subplots
    fig = make_subplots(rows=n_rows, cols=n_cols, 
                       subplot_titles=numerical_cols,
                       vertical_spacing=0.1,
                       horizontal_spacing=0.05)
    
    # Current position tracker
    row = 1
    col = 1
    
    for col_name in numerical_cols:
        # Remove NaN values for this column
        clean_data = df[col_name].dropna()
        
        if len(clean_data) > 0:  # Only create plot if we have non-NaN values
            # Create histogram with KDE
            hist = go.Histogram(x=clean_data, 
                              name=col_name,
                              nbinsx=30,
                              histnorm='probability density')
            
            # Calculate KDE only if we have enough data points
            if len(clean_data) > 1:
                kde_points = np.linspace(clean_data.min(), clean_data.max(), 100)
                kde = np.histogram(clean_data, bins=30, density=True)[0]
                
                # Add KDE line
                kde_line = go.Scatter(x=kde_points, 
                                    y=np.interp(kde_points, 
                                              np.linspace(clean_data.min(), 
                                                        clean_data.max(), 
                                                        30),
                                              kde),
                                    name=f'{col_name}_kde',
                                    line=dict(color='red'))
                
                # Add traces to subplot
                fig.add_trace(hist, row=row, col=col)
                fig.add_trace(kde_line, row=row, col=col)
            else:
                # If not enough data points for KDE, just add histogram
                fig.add_trace(hist, row=row, col=col)
        
        # Update position for next plot
        if col == n_cols:
            col = 1
            row += 1
        else:
            col += 1
    
    # Update layout
    fig.update_layout(
        height=500 * n_rows,  # Adjust height based on number of rows
        width=1600,          # Fixed width
        showlegend=False,
        title_text="Distribusi Variabel Numerik"
    )
    
    # Update axes labels
    fig.update_xaxes(title_text="Value")
    fig.update_yaxes(title_text="Density")
    
    return fig

numeric_dist_fig = plot_numerical_distributions(df)
numeric_dist_fig.show()

Visualize frequency distribution of RainTomorrow variable

Code
value_counts = df['RainTomorrow'].value_counts(dropna=False)

distribution = value_counts.reset_index()
distribution.columns = ['RainTomorrow', 'Count']

distribution['RainTomorrow'] = distribution['RainTomorrow'].fillna('Null')

bar_chart = px.bar(
    distribution,
    x='RainTomorrow',
    y='Count',
    title='Distribusi variabel RainTomorrow',
    labels={'RainTomorrow': 'Rain Tomorrow', 'Count': 'Count'},
    text='Count'
)
bar_chart.update_traces(textposition='outside')

pie_chart = px.pie(
    distribution,
    names='RainTomorrow',
    values='Count',
    title='Distribusi variabel RainTomorrow'
)

bar_chart.show()
pie_chart.show()

Pembuatan Model


Preprocessing

Memisah data menjadi fitur (X) dan target (y)

Code
X = df.drop(columns=['Date', 'RainTomorrow'])
y = df['RainTomorrow']

display("X")
display(X.head())

display("Y")
display(y.head())
'X'
Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am WindDir3pm ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W WNW ... 24.0 71.0 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8 No
1 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW WSW ... 22.0 44.0 25.0 1010.6 1007.8 NaN NaN 17.2 24.3 No
2 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W WSW ... 26.0 38.0 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2 No
3 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE E ... 9.0 45.0 16.0 1017.6 1012.8 NaN NaN 18.1 26.5 No
4 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE NW ... 20.0 82.0 33.0 1010.8 1006.0 7.0 8.0 17.8 29.7 No

5 rows × 21 columns

'Y'
0    No
1    No
2    No
3    No
4    No
Name: RainTomorrow, dtype: object

Memisahkan data menjadi training data dan testing data

Code
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 37)

X_train
Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am WindDir3pm ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
22861 NorfolkIsland 19.8 26.6 0.0 6.0 11.1 SW 26.0 WNW SSW ... 19.0 74.0 74.0 1012.4 1011.5 1.0 3.0 24.7 26.2 No
92291 GoldCoast 18.4 25.5 0.0 NaN NaN NE 33.0 NW NE ... 26.0 61.0 61.0 1019.2 1015.5 NaN NaN 21.0 23.7 No
84611 Brisbane 23.6 29.1 0.0 6.6 3.2 SE 43.0 SE SE ... 13.0 67.0 73.0 1021.3 1020.8 7.0 7.0 25.9 24.4 No
102633 Nuriootpa 10.8 23.4 0.0 9.0 11.0 WSW 41.0 SSE WSW ... 6.0 64.0 30.0 1016.7 1014.7 2.0 3.0 15.3 23.3 No
16017 Newcastle 4.2 19.9 0.0 NaN NaN NaN NaN NW NaN ... 0.0 78.0 42.0 NaN NaN 0.0 0.0 9.5 19.0 No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
108888 Albany 12.0 17.1 0.0 3.4 3.3 NaN NaN W N ... 20.0 78.0 79.0 1025.4 1024.5 8.0 5.0 14.5 16.0 No
85546 Brisbane 13.9 28.7 0.0 6.8 12.0 NE 22.0 N NNE ... 11.0 45.0 46.0 1020.9 1015.8 0.0 0.0 24.1 26.5 No
104899 Nuriootpa 8.5 13.4 4.4 3.4 2.4 NW 43.0 NNW N ... 11.0 77.0 91.0 1006.1 1001.4 4.0 8.0 10.3 9.8 Yes
120483 PerthAirport 15.8 26.3 0.0 9.2 13.0 SW 52.0 S SW ... 28.0 56.0 37.0 1010.1 1010.5 1.0 1.0 21.9 24.8 No
20843 NorahHead 15.9 19.1 0.0 NaN NaN SSW 65.0 SW SSW ... 44.0 58.0 72.0 1010.9 1009.9 NaN NaN 17.6 18.2 No

116368 rows × 21 columns

Memisahkan fitur numerik dan kategorikal

Code
numerical_cols = X.select_dtypes(include=['float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns

display(numerical_cols)
display(categorical_cols)
Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm'],
      dtype='object')
Index(['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday'], dtype='object')

Melakukan imputasi terhadap fitur numerik

Code
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

num_imputer = SimpleImputer(strategy='median')
X_train_num = pd.DataFrame(num_imputer.fit_transform(X_train[numerical_cols]), columns=numerical_cols)
X_test_num = pd.DataFrame(num_imputer.transform(X_test[numerical_cols]), columns=numerical_cols)

X_train_num[numerical_cols].isnull().sum()
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
dtype: int64

Melakukan imputasi terhadap fitur kategorikal

Code
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train_cat_imputed = pd.DataFrame(cat_imputer.fit_transform(X_train[categorical_cols]), columns=categorical_cols)
X_test_cat_imputed = pd.DataFrame(cat_imputer.transform(X_test[categorical_cols]), columns=categorical_cols)

X_train_cat_imputed[categorical_cols].isnull().sum()
Location       0
WindGustDir    0
WindDir9am     0
WindDir3pm     0
RainToday      0
dtype: int64

Encode fitur kategorikal menggunakan OneHotEncoder

Code
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_train_cat_encoded = pd.DataFrame(onehot_encoder.fit_transform(X_train_cat_imputed),
                                   columns=onehot_encoder.get_feature_names_out(categorical_cols))
X_test_cat_encoded = pd.DataFrame(onehot_encoder.transform(X_test_cat_imputed),
                                  columns=onehot_encoder.get_feature_names_out(categorical_cols))

X_train_cat_encoded
Location_Adelaide Location_Albany Location_Albury Location_AliceSprings Location_BadgerysCreek Location_Ballarat Location_Bendigo Location_Brisbane Location_Cairns Location_Canberra ... WindDir3pm_S WindDir3pm_SE WindDir3pm_SSE WindDir3pm_SSW WindDir3pm_SW WindDir3pm_W WindDir3pm_WNW WindDir3pm_WSW RainToday_No RainToday_Yes
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
116363 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
116364 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
116365 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
116366 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
116367 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

116368 rows × 99 columns

Menggabungkan fitur yang sudah dipreproses

Code
X_train_processed = pd.concat([X_train_num, X_train_cat_encoded], axis=1)
X_test_processed = pd.concat([X_test_num, X_test_cat_encoded], axis=1)

X_train_processed
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm ... WindDir3pm_S WindDir3pm_SE WindDir3pm_SSE WindDir3pm_SSW WindDir3pm_SW WindDir3pm_W WindDir3pm_WNW WindDir3pm_WSW RainToday_No RainToday_Yes
0 19.8 26.6 0.0 6.0 11.1 26.0 11.0 19.0 74.0 74.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1 18.4 25.5 0.0 4.8 8.4 33.0 9.0 26.0 61.0 61.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 23.6 29.1 0.0 6.6 3.2 43.0 13.0 13.0 67.0 73.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
3 10.8 23.4 0.0 9.0 11.0 41.0 15.0 6.0 64.0 30.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
4 4.2 19.9 0.0 4.8 8.4 39.0 2.0 0.0 78.0 42.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
116363 12.0 17.1 0.0 3.4 3.3 39.0 4.0 20.0 78.0 79.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
116364 13.9 28.7 0.0 6.8 12.0 22.0 6.0 11.0 45.0 46.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
116365 8.5 13.4 4.4 3.4 2.4 43.0 15.0 11.0 77.0 91.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
116366 15.8 26.3 0.0 9.2 13.0 52.0 28.0 28.0 56.0 37.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
116367 15.9 19.1 0.0 4.8 8.4 65.0 24.0 44.0 58.0 72.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

116368 rows × 115 columns

Melakukan imputasi terhadap target variabel

Code
target_imputer = SimpleImputer(strategy='most_frequent')
y_train_imputed = target_imputer.fit_transform(y_train.values.reshape(-1, 1)).flatten()  # Fit to train
y_test_imputed = target_imputer.transform(y_test.values.reshape(-1, 1)).flatten()

np.unique(y_train_imputed)
array(['No', 'Yes'], dtype=object)

Encode target variabel menggunakan LabelEncoder

Code
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train_imputed)
y_test_encoded = label_encoder.transform(y_test_imputed)

y_train_encoded
array([0, 0, 1, ..., 1, 0, 0], shape=(116368,))

Pelatihan dan Evaluasi Model

Latih model dengan menggunakan KNN

Code
knn = KNeighborsClassifier()
knn.fit(X_train_processed, y_train_encoded)
y_pred_enc = knn.predict(X_test_processed)
y_pred_enc
array([0, 0, 0, ..., 0, 0, 0], shape=(29092,))
Hasil
Code
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, ConfusionMatrixDisplay, precision_score, recall_score, f1_score, classification_report, roc_curve, auc, precision_recall_curve, average_precision_score
from sklearn.metrics import accuracy_score, roc_auc_score, cohen_kappa_score, RocCurveDisplay, classification_report

accuracy = accuracy_score(y_test_encoded, y_pred_enc)
cm = confusion_matrix(y_test_encoded, y_pred_enc)
roc_auc = roc_auc_score(y_test_encoded, y_pred_enc)

display("Accuracy: " + str(accuracy))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=knn.classes_)
disp.plot(cmap=plt.cm.Blues)

roc_disp = RocCurveDisplay.from_predictions(y_test_encoded, y_pred_enc)
plt.show()
'Accuracy: 0.8399216279389523'

Mencoba tuning KNN dengan menggunakan parameter

Code
knn = KNeighborsClassifier(n_neighbors=4, metric="manhattan")
knn.fit(X_train_processed, y_train_encoded)
y_pred_enc = knn.predict(X_test_processed)
0.8418121820431734

Hasil

Code
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, ConfusionMatrixDisplay, precision_score, recall_score, f1_score, classification_report, roc_curve, auc, precision_recall_curve, average_precision_score
from sklearn.metrics import accuracy_score, roc_auc_score, cohen_kappa_score, RocCurveDisplay, classification_report

accuracy = accuracy_score(y_test_encoded, y_pred_enc)
cm = confusion_matrix(y_test_encoded, y_pred_enc)
roc_auc = roc_auc_score(y_test_encoded, y_pred_enc)

display("Accuracy: " + str(accuracy))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=knn.classes_)
disp.plot(cmap=plt.cm.Blues)

roc_disp = RocCurveDisplay.from_predictions(y_test_encoded, y_pred_enc)
plt.show()
'Accuracy: 0.8418121820431734'